Utterance Selection Model of Language Change 



in 
o 
o 

(N 
o 

Q 

(N 
(N 



43 

O 



CO 



I 

C 

O 

o 



> 

00 
00 

in 

in 
o 

-I— » 
c3 



-a 
c 

o 
o 



X 



G. J. Baxter, 1 R. A. Blythe, 1 - 2 W. Croft, 3 and A. J. McKane 1 

1 School of Physics and Astronomy, University of Manchester, Manchester M13 9PL, U.K. 
^School of Physics, University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, U.K. 
3 School of Languages, Linguistics and Cultures, University of Manchester, Manchester M13 9PL, U.K. 

(Dated: February 3, 2008) 

We present a mathematical formulation of a theory of language change. The theory is evolutionary 
in nature and has close analogies with theories of population genetics. The mathematical structure 
we construct similarly has correspondences with the Fisher- Wright model of population genetics, 
but there are significant differences. The continuous time formulation of the model is expressed in 
terms of a Fokker-Planck equation. This equation is exactly soluble in the case of a single speaker 
and can be investigated analytically in the case of multiple speakers who communicate equally with 
all other speakers and give their utterances equal weight. Whilst the stationary properties of this 
system have much in common with the single-speaker case, time-dependent properties are richer. 
In the particular case where linguistic forms can become extinct, we find that the presence of many 
speakers causes a two-stage relaxation, the first being a common marginal distribution that persists 
for a long time as a consequence of ultimate extinction being due to rare fluctuations. 
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I. INTRODUCTION 

Stochastic many-body processes have long been of in- 
terest to physicists, largely from applications in con- 
densed matter and chemical physics, such as surface 
growth, the aggregation of structures, reaction dynam- 
ics or pattern formation in systems far from equilib- 
rium. Through these studies, statistical physicists have 
acquired a range of analytical and numerical techniques 
along with insights into the macroscopic phenomena that 
arise as a consequence of noise in the dynamics. It is 
therefore not surprising that physicists have begun to 
use these methods to explore emergent phenomena in 
the wider class of complex systems which — in addition to 
stochastic interactions — might invoke a selection mecha- 
nism. In particular, this can lead to a system adapting 
to its environment. 

The best-known process in which selection plays an 
important part is, of course, biological evolution. More 
generally, one can define an evolutionary dynamics as 
being the interplay between three processes. In addi- 
tion to selection, one requires replication (e.g., of genes) 
to sustain a population and variation (e.g., mutation) 
so that there is something to select on. A generalized 
evolutionary theory has been formalized by biologist and 
philosopher of science David Hull [1, 2] that includes as 
special cases both biological and cultural evolution. The 
latter of these describes, for example, the propagation 
of ideas and theories through the scientific community, 
with those theories that are "fittest" (perhaps by pre- 
dicting the widest range of experimental results) having a 
greater chance of survival. Within this generalized evolu- 
tionary framework, a theory of language change has been 
developed [3-5] which we examine from the point of view 
of statistical physics in this paper. 

Since it is unlikely that the reader versed in statis- 
tical physics is also an expert in linguistics, we spend 



some time in the next section outlining this theory of 
language change. Then, our formulation of a very simple 
mathematical model of language change that we define 
in Sec. Ill should seem rather natural. As this is not the 
only evolutionary approach that has been taken to the 
problem of language change, we provide — again, for the 
nonspecialist reader — a brief overview of relevant model- 
ing work one can find in the literature. The remainder 
of this paper is then devoted to a mathematical analysis 
of our model. 

A particular feature of this model is that all speak- 
ers continuously vary their speech patterns according 
to utterances they hear from other speakers. Since in 
our model, the utterances produced represent a finite- 
sized sample of an underlying distribution, the language 
changes over time even in the absence of an explicit selec- 
tion mechanism. This process is similar to genetic drift 
that occurs in biological populations when the individ- 
uals chosen to produce offspring in the next generation 
are chosen entirely at random. Our model also allows for 
language change by selection as well as drift (see Sec. III). 
For this reason, we describe the model as the "utterance 
selection model" [3]. 

As it happens, the mathematics of our model of lan- 
guage change turn out to be almost identical to those de- 
scribing classical models in population genetics. This we 
discover from a Fokker-Planck equation for the evolution 
of the language, the derivation of which is given in Sec. V. 
Consequently, we have surveyed the existing literature on 
these models, and by doing so obtained a number of new 
results which we outline in Sec. VII and whose detailed 
derivation can be found elsewhere [6]. Since in the lan- 
guage context, these results pertain to the rather limiting 
case of a single speaker — which is nevertheless nontrivial 
because speakers monitor their own language use — we ex- 
tend this in Sec. VIII to a wider speech community. In all 
cases we concentrate on properties indicative of change, 
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such as the probability that certain forms of language fall 
into disuse, or the time it takes for them to do so. Es- 
tablishing these basic facts is an important step towards 
realizing our future aims of making a meaningful com- 
parison with observational data. We outline such scope 
for future work and discuss our results in the concluding 
section. 



II. LANGUAGE CHANGE AS AN 
EVOLUTIONARY PROCESS 

In order to model language change we focus on lin- 
guistic variables, which are essentially "different ways of 
saying the same thing" . Examples include the pronunci- 
ation of a vowel sound, or an ordering of words according 
to their function in the sentence. In order to recognize 
change when it occurs, we will track the frequencies with 
which distinct variants of a particular linguistic variable 
are reproduced in utterances by a language's speakers. 
Let us assume that amongst a given group of speakers, 
one particular variant form is reproduced with a high 
frequency. This variant we shall refer to as the con- 
vention among that group of speakers. Now, it may be 
that, over time, an unconventional — possibly completely 
new — variant becomes more widely used amongst this 
group of speakers. Clearly one possibility here is that by 
becoming the most frequently used variant, it is estab- 
lished as the new convention at the expense of the exist- 
ing one. It is this competition between variant forms, and 
particularly the propagation of innovative forms across 
the speech community, that we arc interested in. 

We have so far two important ingredients in this pic- 
ture of language change: the speakers, and the utterances 
they produce. The object relating a speaker to her [47] 
utterances we call a grammar. More precisely, a speaker's 
grammar contains the entirety of her knowledge of the 
language. We assume this to depend on the frequencies 
she has heard particular variant forms used within her 
speech community [7, 8] . In turn, grammars govern the 
variants that are uttered by speakers, and how often. 

Clearly, a "real-world" grammar must be an extremely 
complicated object, encompassing a knowledge of many 
linguistic variables, their variant forms and their suit- 
ability for a particular purpose. However, it is noticed 
that even competent speakers (i.e., those who are highly 
aware of the various conventions among different groups) 
might use unconventional variants if they have become 
entrenched [3]. For example, someone who has lived for 
a long time in one region may continue to use parts of the 
dialect of that region after moving to a completely new 
area. This fact will impact on our modeling in two ways. 
First, we shall assume that a given interaction (conver- 
sation) between two speakers has only a small effect on 
the established grammar. Second, speakers will reinforce 
their own way of using language by keeping a record of 
their own utterances. 

Another observed feature of language use is that there 



is considerable variation, not just from speaker to speaker 
but also in the utterances of a single speaker. There are 
various proposals for the origin of this variation. On the 
one hand, there is evidence for certain variants to be fa- 
vored due to universal forces of language change. For in- 
stance articulatory and acoustic properties of sounds, or 
syntactic processing factors — which are presumed com- 
mon to all speakers — favor certain phonetic or syntactic 
changes over others [9, 10]. These universals can be rec- 
ognized through a high frequency of such changes occur- 
ring across many speech communities. 

On the other hand, variation could reflect the wide 
range of possible intentions a speaker could have in com- 
municative enterprise. For example, a particular non- 
conventional choice of variant might arise from the de- 
sire not to be misunderstood, or to impress, flatter or 
amuse the listener [11]. Nevertheless, in a recent analy- 
sis of language use with a common goal [12], it was ob- 
served that variation is present in nearly all utterances. 
It seems likely, therefore, that variation arises primar- 
ily as a consequence of the fact that no two situations 
are exactly alike, nor do speakers construe a particular 
situation in exactly the same way. Hence there is a fun- 
damental indeterminacy to the communicative process. 
As a result, speakers produce variant forms for the same 
meaning being communicated. These forms are words or 
constructions representing possibly novel combinations, 
and occasionally completely novel utterances. Given the 
large number of possible sources of variation and innova- 
tion, we feel it appropriate to model these effects using a 
stochastic prescription. 

In order to complete the evolutionary description, we 
require a mechanism that selects an innovative variant for 
subsequent propagation across the speech community. In 
the theory of Ref. [3] it is proposed that social forces play 
this role. This is based on the observation that speakers 
want to identify with certain subgroups of a society, and 
do so in part by preferentially producing the variants 
produced by members of the emulated subgroup [13, 14]. 
That is, the preference of speakers to produce variants 
associated with certain social groups acts as a selection 
mechanism for those variants. 

This particular evolutionary picture of language 
change (see Sec. IV for contrasting approaches) places 
an emphasis on utterances (perhaps more so than on the 
speakers). Indeed, in Ref. [3] the utterance is taken as 
the linguistic analog of DNA. As speakers reproduce ut- 
terances, linguistic structures get passed on from gener- 
ation to generation (which one might define as a partic- 
ular time interval). For this reason, the term lingueme 
has been coined in [3] to refer to these structures, and 
to emphasize the analogy with genetics. One can then 
extend to analogy to identify linguistic variables with a 
particular gene locus and variant forms with alleles. 

We stress, however, that the analogy between this evo- 
lutionary formulation of language change and biological 
evolution is not exact. The distinction is particularly 
clear when one views the two theories in the more general 
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framework of Hull [1, 2, 4], The two relevant concepts 
are interactors and replicators whose roles are played in 
the biological system by individual organisms and genes 
respectively. In biology, a replicator (a gene) "belongs 
to" an interactor (an organism), thereby influencing its 
survival and reproductive ability of the interactor. This 
is then taken as the dominant force governing the make- 
up of the population of replicators in the next generation. 
The survivability of a replicator is not due to an inher- 
ent "fitness" : it is the organism whose fitness leads to the 
differential survival or extinction of replicators. Also, the 
relationship between genotype and phenotype is indirect 
and complex. Nevertheless, there is a sufficient correla- 
tion between genes and phenotypic traits of organisms 
such that the differential survival of the latter causes the 
differential survival of the former (this is Hull's definition 
of "selection"), but the correlation is not a simple one. 

In the linguistic theory outlined here, the interactors 
(speakers) and replicators (linguemes) have quite differ- 
ent relationships to one another. The replicators are ut- 
tered by speakers, and there is no one-to-one relationship 
between a replicator (a lingueme) and the speaker who 
produces it. Nevertheless, Hull's generalized theory of 
selection can be applied to the lingueme as replicator 
and the speaker as interactor. Linguemes and lingueme 
variation is generated by speaker intercourse, just as new 
genotypes are generated by sexual intercourse. The gen- 
eration process is replication, that is, speakers are repli- 
cating sounds, words and construction they have heard 
before. Finally, the differential survival of the speak- 
ers, that is, their social "success" , causes the differential 
survival of the linguemes they produce, and so the so- 
cial mechanisms underlying the propagation of linguistic 
variants conforms to Hull's definition of selection. 

In short, we do not suppose that the language uttered 
by an interactor has any effect on its survival, believing 
the dominant effects on language change to be social in 
origin. That is, the survivability of a replicator is not due 
to any inherent fitness, but arises instead from the social 
standing of individuals associated with the use of the 
corresponding variant form. It is therefore necessary that 
in formulating a mathematical model of language change, 
one should not simply adapt an existing biological theory, 
but start from first principles. This is the program we 
now follow. 



III. DEFINITION OF THE UTTERANCE 
SELECTION MODEL 

The utterance selection model comprises a set of rules 
that govern the evolution of the simplest possible lan- 
guage viewed from the perspective of the previous sec- 
tion. This language has a single lingueme with a re- 
stricted number V > 2 variant forms. At present we 
simply assume the existence of multiple variants of a 
lingueme: modeling the the communicative process and 
the means by which indeterminacy in communication (see 




FIG. 1: Speakers in the society interact with different fre- 
quencies (shown here schematically by different thicknesses 
of lines connecting them). The pair of speakers i,j is chosen 
to interact with probability Gij. 



Sec. II) leads to the generation of variation is left for fu- 
ture work. 

In the speech community we have N individuals, each 
of whose knowledge of the language — the grammar — is 
encoded in the set X(t) of variables Xi v (t). In a manner 
shortly to be defined precisely, the variable Xi V {t) reflects 
speaker z's (1 < i < N) perception of the frequency with 
which lingueme variant v (1 < v < V) is used in the 
speech community at time t. At all times these variables 
are normalized so that the sum over all variants for each 
speaker is unity, that is 

v 

53x to (t) = lVi > *. (1) 

v=l 

For convenience we will sometimes use a vector notation 
%i = {xi\, ■ ■ ■ ,Xiv) to denote the entirety of speaker i's 
grammar. The state of the system X(t) at time t is then 
the aggregation of grammars X(t) = (x\{t), . . . ,x^(t)). 

After choosing some initial condition (e.g., a random 
initial condition), we allow the system to evolve by re- 
peatedly iterating the following three steps in sequence, 
each iteration having duration St. 

1. Social interaction. A pair i,j of speakers is chosen 
with a (prescribed) probability Gij. There is no notion 
of an ordering of a particular pair of speakers in this 
model, and so we implicitly have Gy = Gji, normalized 
such that the sum over distinct pairs Yl/j j\ Gij — 1. See 
Fig. 1. 

2. Reproduction. Both the speakers selected in step 
1 produce a set of T tokens, i.e., instances of lingueme 
variants. Each token is produced independently and at 
random, with the probability that speaker i utters variant 
v equal to the production probability x' iv (t) which will be 
determined in one of two ways (see below). The numbers 
of tokens nu (t), . . . , niv (t) of each variant are then drawn 



FIG. 2: Both speakers i and j produce an utterance, with par- 
ticular lingueme variants appearing with a frequency given by 
the value stored in the utterer's grammar when no production 
biases are in operation. In this particular case three variants 
are shown (a, (5 and 7) and the number of tokens, T, is equal 
to 6. 



from the multinomial distribution 



P(fi i ,x' i ) = 



T 



■ n iV 



( x il, 



(2) 



where x{ = (x' a , . . . , x' iv ), r? 4 = . . . , n iv ), 

S^=i n ™ = T , and where we have dropped the explicit 
time dependence to lighten the notation. Speaker j pro- 
duces a sequence of tokens according to the same pre- 
scription, with the obvious replacement i — ► j. The ran- 
domness in this step is intended to model the observed 
variation in language use that was described in the pre- 
vious section. 

The first, and simplest possible prescription for ob- 
taining the reproduction probabilities is simply to assign 
x 'iv W = x ™ (t) ■ Since the grammar is a function of the 
speaker's experience of language use — the next step ex- 
plains precisely how — this reproduction rule does not in- 
voke any favoritism towards any particular variants on 
behalf of the speaker. We therefore refer to this case as 
unbiased reproduction, depicted in Fig. 2. 

We shall also study a biased reproduction model, illus- 
trated in Fig. 3. Here, the reproduction probabilities are 
a linear transformation of the grammar frequencies, i.e., 
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^ M vw a 



,(t) 



(3) 



in which the matrix M must have column sums of unity 
so that the production probabilities are properly normal- 
ized. This matrix M is common to all speakers, which 
would be appropriate if one is considering effects of uni- 
versal forces (such as articulatory considerations) on lan- 
guage. Furthermore, in contrast to the unbiased case, 
this reproduction model admits the possibility of inno- 
vation, i.e., the production of variants that appear with 
zero frequency in a speaker's grammar. 

3. Retention. The final step is to modify each speaker's 
grammar to reflect the actual language used in the course 
of the interaction. The simplest approach here is to add 
to the existing speaker's grammar additional contribu- 
tions which reflect both the tokens produced by her and 
by her interlocutor. The weight given to these tokens, 



FIG. 3: In the biased reproduction model, the probability of 
uttering a particular variant is a linear combination M of the 
values stored in the grammar. 




FIG. 4: After the utterances have been produced, both speak- 
ers modify their grammars by adding to them the frequencies 
with which the variants were reproduced in the conversation. 
Note each speaker retains both her own utterances as well as 
those of her interlocutor, albeit with different weights. 



relative to the existing grammar, is given by a parameter 
A. Meanwhile, the weight, relative to her own utterances, 
that speaker i gives to speaker j's utterances is specified 
by Hij. This allows us to implement the social forces 
mentioned in the previous section. These considerations 
imply that 



Xi(t + St) oc 



Xi(t) + A 



T 



Hi 



' T 



(4) 



for speaker i, and the same rule for speaker j after ex- 
changing all i and j indices. Fig. 4 illustrates this step. 
The parameter A, which affects how much the grammar 
changes as a result of the interaction is intended to be 
small, for reasons given in the previous section. 

We must also ensure that the normalization (1) is 
maintained. Therefore, 



Xi(t + 5t) 



Xjjt) + I (nj(t) + Hijfjjjt)) 

l + A(l + fly) 



(5) 



Although we have couched this model in terms of the 
grammar variables X{ V (t) , we should stress that these are 
not observable quantities. Really, we should think in 
terms of the population of utterances produced in a par- 
ticular generation, e.g., a time interval At 3> 5t as in- 
dicated in Fig. 5. However, since the statistics of this 
population can be derived from the grammar variables — 
indeed, in the absence of production biases they are the 
same — we shall in the following focus on the latter. 
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FIG. 5: A generation of a population of utterances in the 
utterance selection model could be defined as the set of tokens 
produced by all speakers in the macroscopic time interval At. 



IV. COMPARISON WITH OTHER MODELS OF 
LANGUAGE CHANGE 

Evolutionary modeling has a long history in the field 
of language change and development. Indeed, at a num- 
ber of points in The Origin of the Species, Charles Dar- 
win makes parallels between the changes that occur in 
biological species and in languages. Particularly, he 
used our everyday observation that languages tend to 
change slowly and continuously over time to challenge 
the then prevailing view that biological species were dis- 
tinct species, occupying immovable points in the space 
of all possible organisms. As evolutionary theories of bi- 
ology have become more formalized, it is not surprising 
that these there have been a number of attempts to ap- 
ply more formal evolutionary ideas to language change 
(see, e.g., [15]). In this Section we describe a few of these 
studies in order that the reader can see how our approach 
differs from others one can find in the literature. 

One area in which biological evolution plays a part is 
the development of the capacity to use language (see, 
e.g., [16] for a brief overview). Although this is in it- 
self an interesting topic to study, we do not suppose that 
this (presumably) genetic evolution is strongly related to 
language change since the latter occurs on much shorter 
timescales. For example, the FOXP2 gene (which is be- 
lieved to play a role in both language production and 
comprehension) became fixed around 120,000 years ago 
[17], whereas the patterns in the use of linguistic variables 
can change over periods as short as tens of years. 

Given an ability to use language, one can ask how the 
various linguistic structures (such as particular aspects 
of grammar or syntax) come into being [18]. Here evo- 
lutionary models that place particular emphasis on lan- 
guage learning are often employed. Some aspects of this 
type of work are reviewed in [19] — here we remark that 
in order to see the emergence of grammatical rules, one 
must model a grammar at a much finer level than we have 
done here. Indeed, we have left aside the (nevertheless in- 
teresting) question of how an innovation is recognized as 
"a different way of saying the same thing" by all speakers 



in the community. Instead, we assume that this agree- 
ment is always reached, and concentrate on the fate of 
new variant forms. 

Similar kinds of assumptions have been used in a 
learning-based context by Niyogi and Berwick [20] to 
study language change. In learning-based models in gen- 
eral, the mechanism for language change lies in speakers 
at an early stage of their life having a (usually finite) set 
of possible grammars to choose from, and use the data 
presented to them by other speakers to hypothesize the 
grammar being used to generate utterances. Since these 
data are finite, there is the possibility for a child listening 
to language in use to infer a grammar that differs from his 
parents', and becomes fixed once a speaker reaches ma- 
turity. Our model of continuous grammatical change as 
a consequence of exposure to other speakers at all stages 
in a speaker's life is quite different to learning-based ap- 
proaches. In particular, it assumes an inductive model 
of language acquisition [21], in which the child entertains 
hypotheses about sets of words and grammatical con- 
structions rather than about entire discrete grammars. 
Thus, our model does not assume that a child has in her 
mind a large set of discrete grammars. 

The specific model in [20] assigns grammars (lan- 
guages) to a proportion of the population of speakers in 
a particular generation. A particular learning algorithm 
then implies a mapping of the proportions of speakers 
using a particular language from one generation to the 
next. Since one is dealing with nonlinear iterative maps, 
one can find familiar phenomena such as bifurcations 
and phase transitions [22] in the evolution of the lan- 
guage. Note, however, that the dynamics of the popula- 
tion of utterances and speakers are essentially the same in 
this model, since the only thing distinguishing speakers 
is grammar. In the utterance selection model, we have 
divorced the population dynamics of speakers and utter- 
ances, and allow the former to be distinguished in terms 
of their social interactions with other speakers (which is 
explicitly ignored in [20]). This has allowed us to take 
a fixed population of speakers without necessarily pre- 
venting the population of utterances to change. In other 
words, language change may occur if the general struc- 
ture of a society remains intact as individual speakers 
are replaced by their offspring, or even during a period 
of time when there is no change in the makeup of the 
speaker population; both of these possibilities are widely 
observed. 

An alternative approach to language change in the 
learning-based tradition is not to have speakers attempt 
to infer the grammatical rules underpinning their par- 
ents' language use, but to select a grammar based on 
how well it permits them to communicate with other 
members of the speech community. This path has been 
followed most notably by Nowak and coworkers in a se- 
ries of papers (including [23, 24]) as well as by members 
of the statistical physics community [25]. This thinking 
allows one to borrow the notion of fitness from biologi- 
cal evolutionary theories — the more people a particular 
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grammar allows you to communicate with, the fitter it 
is deemed to be. In order for language use to change, 
speakers using a more coherent grammar selectively pro- 
duce more offspring than others so that the language as a 
whole climbs a hill towards maximal coherence. The dif- 
ferences between this and our way of thinking should be 
clear from Sec. II. In particular we assume no connection 
between the language a speaker uses and her biological 
reproductive fitness. Finally on the subject of learning- 
based models, we remark that not all of them assume 
language transmission from parents to offspring. For ex- 
ample, in [26] the effects of children also learning from 
their peers are investigated. 

Perhaps closer in spirit to our own work are studies 
that have languages competing for speakers. The sim- 
plest model of this type is due to Abrams and Strogatz 
[27] which deems a language "attractive" if it is spoken by 
many speakers or has some (prescribed) added value. For 
example, one language might be of greater use in a trad- 
ing arrangement. In [27] good agreement with available 
data for the number of speakers of minority languages 
was found, revealing that the survival chances of such 
languages are typically poor. More recently, the model 
has been extended by Minett and Wang [28] to implement 
a structured society and the possibility of bilingualism. 
One might view the utterance selection model as being 
relevant here if the variant forms of a lingucme represent 
different languages. However, there are then significant 
differences in detail. First, the way the utterance selec- 
tion model is set up would imply that all languages are 
mutually intelligible to all speakers. Second, in the mod- 
els of [27, 28], learning a new language is a strategic de- 
cision whereas in the utterance selection model it would 
occur simply through exposure to individuals speaking 
that language. 

To summarize, the distinctive feature of our modeling 
approach is that we consider the dynamics of the popula- 
tion of utterances to be separate from that of the speech 
community (if indeed the latter changes at all). Fur- 
thermore, we assume that language propagates purely 
through exposure with social status being used as a se- 
lection process, rather than through some property of 
the language itself such as coherence. The purpose of 
this work is to establish an understanding of the conse- 
quences of the assumptions we have made, particularly 
in those cases where the utterance selection model can 
be solved exactly. 



V. CONTINUOUS-TIME LIMIT AND 
FOKKER-PLANCK EQUATION 

We begin our analysis of the utterance selection model 
by constructing a Fokkcr-Planck equation via an appro- 
priate continuous-time limit. There are several ways one 
could proceed here. For example, one could scale the 
interaction probabilities Gij proportional to St (the con- 
stant of proportionality then being an interaction rate). 



Whilst this would yield a perfectly acceptable continuous 
time process, the Fokkcr-Planck equation that results is 
unwieldy and intractable. Therefore we will not follow 
this path, but will discuss two other approaches below. 
The first will be applicable when the number of tokens is 
large. This will not generally be the case, but will serve 
to motivate the second approach, which is closer to the 
situation which we are modeling. 

A. The continuous time limit 

To clarify the derivation it is convenient to start with 
a single speaker which, although linguistically trivial, is 
far from mathematically trivial. It also has an impor- 
tant correspondence to population dynamics, which is 
explored in more detail in Sec. VI. In this case there is 
no matrix Hij , and in fact we can drop the indices i and 
j completely. This means that the update rule (5) takes 
the simpler form 

x(t) + £n(t) , , 

*(* + St) = K '- + T X 1 ' (6) 

and so Sx(t) = x(t + St) — x(t) is given by 

^') = TTa(^ -**))■ (7) 

The derivation of the Fokker-Planck equation involves 
the calculation of averages of powers of Sx{t). Using 
Eq. (2), the average of ft is Tx'. If we begin by assuming 
unbiased reproduction, then x' = x and so the average 
of Sx(t) is zero. In the language of stochastic dynamics, 
there is no deterministic component — the only contribu- 
tion is from the diffusion term. This is characterized by 
the second moment which is calculated in the Appendix 
to be 

A 2 1 

{Sx v (t)Sx w (t)) — — j ——j — {x v S vw x v x w ) , (8) 
U + A i 1 

where the angle brackets represent an average over all 
possible realizations. To give a contribution to the 
Fokker-Planck equation, the second moment (8) has to 
be of order St. One way to arrange this is as follows. 
We choose the unit of time such that T utterances are 
made in unit time. Thus the time interval between ut- 
terances, St = 1/T, is small if T is large. Furthermore, 
although the frequency of a particular variant in an ut- 
terance, n v /T, varies in steps, the steps are very small. 
Therefore, when T becomes very large, the time and vari- 
ant frequency steps become very small and can be ap- 
proximated as continuous variables. The second jump 
moment, which is actually what appears in the Fokker- 
Planck equation, is found by dividing the expression (8) 
by St = T _1 , and letting St -> 0: 

A 2 

OL vw {x^t) — — ^ {x v S vw x v x w ) . (9) 
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Since the higher moments of the multinomial distribution 
involve higher powers of T^ 1 = St, they give no contri- 
bution, and the only non-zero jump moment is given by 
Eq. (9). As discussed in the Appendix, or in standard 
texts on the theory of stochastic processes [29, 30], this 
gives rise to the Fokker-Planck equation 



dP(x, t) 

di 



A 2 



& 2 



2(1 + A) 2 ^ dx v dx 



~{XyS V ^yj XyXyj^jP^X , t) , 
1 

(10) 

where we have suppressed the dependence of the proba- 
bility distribution function P(x, t) on the initial state of 
the system. 

The equation (10) holds only for unbiased reproduc- 
tion. It can be generalized to biased reproduction by not- 
ing that as T — > oo, this process becomes deterministic. 
Thus Eq. (7) is replaced by the deterministic equation 



Sx 



A 



1 + A 



(x' - x) . 



(11) 



However, we may write Eq. (3) using the condition 

M wv = 1 as 



^ ^ ~M vw x w ^ ^ M- WV X V 
w w 

^ (M vw x w - M wv x v ) . (12) 



The diagonal entries of M are omitted in the last line 
because the condition J2 w M wv — 1 means that in each 
column one entry is not independent of the others. If 
we choose this entry to be the one with w = v, then 
all elements of M in Eq. (12) are independent. Thus 
the diagonal entries of M have no significance; they are 
simply given by M vv = 1 - J2 w ^v M wv From Eqs. (11) 
and (12) we see that in order to obtain a finite limit as 
St — > 0, we need to assume that the off-diagonal entries of 
M are of order St. Specifically, we define M vw — m vw 8t 
for v w. Then in the limit St ^ 0, 



dx v (t) 
dt 



A 



(1 + A) 



^ ^ ij^vwXw TFlwvXv) • (1^) 



Deterministic effects such as this give rise to 0(St) contri- 
butions in the derivation of the Fokker-Planck equation, 
unlike the O(St) 1 / 2 contributions arising from diffusion. 
Therefore, the first jump moment in the case of biased 
reproduction is given by the right-hand side of Eq. (13). 
The second jump moment is still given by Eq. (9), since 
any additional terms involving M vw are of order St and 
so give terms which vanish in the St ^ limit. This dis- 
cussion may be straightforwardly extended to the case 
of many speakers. The only novel feature is the appear- 
ance of the matrix H^ . In order to obtain a deterministic 
equation of the type (13), a new matrix has to be defined 
by Hij = h i0 St. 

Thus, in summary, what could be called the "large T 
approximation" is obtained by choosing St = T , and 



defining new matrices m and h through M vw = m vw St 
for v ^ w and Hj = hijSt. It is the classic way of 
deriving the Fokker-Planck equations as the "diffusion 
approximation" to a discrete process. However, for our 
purposes it is not a very useful approximation. This is 
simply because we do not expect that in realistic situa- 
tions the number of tokens will be large, so it would be 
useful to find another way of taking the continuous-time 
limit. Fortunately, another parameter is present in the 
model which we have not yet utilized. This is A, which 
characterizes the small effect that utterances have on the 
speaker's grammar. If we now return to the case of a 
single speaker with unbiased reproduction, we see from 
Eq. (8), that an alternative to taking T^ 1 = St, is to take 
A = {St) 1 / 2 . Thus, in this second approach, we leave T 
as a parameter in the model, and set the small parameter 
A equal to (St) 1 / 2 . The second jump moment (9) in this 
formulation is replaced by 



&vw (x, t) — ^ (x v S vw X V X W ) 



(14) 



Bias may be introduced as before, and gives rise to 
Eqs. (11) and (12). The difference in this case is that A 
has been assumed to be 0(St) 1 / 2 , and so the off-diagonal 
entries of M (and the entries of H in the case of more 
than one-speaker) have to be rescaled by (St) 1 / 2 , rather 
than St. This means that in this second approach we 
must rescale the various parameters in the model accord- 
ing to 

A = (St) 1 / 2 (15) 
M vw = m vw (St) 1/2 for v ^ w (16) 
Hj - h^St) 1 ' 2 (17) 

as St — > 0. We have found good agreement between the 
predictions obtained using this continuous-time limit and 
the output of Monte Carlo simulations when A was suffi- 
ciently small, e.g., A « 10~ 3 . 



B. The general form of the Fokker-Planck equation 

In Sec. VA we have outlined the considerations in- 
volved in deriving a Fokker-Planck equation to describe 
the process. We concluded that, for our present pur- 
poses, the scalings given by Eqs. (15)-(17) were most ap- 
propriate. Much of the discussion was framed in terms 
of a single speaker, because the essential points are al- 
ready present in this case, but here will study the full 
model. The resulting Fokker-Planck equation describes 
the time evolution of the probability distribution func- 
tion P(X, t\X o ,0) for the system to be in state X at 
time t given it was originally in state Xq, although we 
will frequently suppress the dependence on the initial 
conditions. The variables X comprise N(V — 1) inde- 
pendent grammar variables, since the grammar variable 
Xiv is determined by the normalization Y^j=i x ™ = 1- 
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The derivation of the Fokker-Planck equation is given 
in the Appendix. It contains three operators, each 
of which corresponds to a distinct dynamical process. 
Specifically, one has for the evolution of the distribution 



dP(X,t) 
Ft 



^(bias) _|_ £(rep) 



P(X,t) 



in which G, = i s the probability that speaker i 

participates in any interaction. 

The operator 



5 (bias) _ 



V-1 



d 



■0 = 1 



arises as a consequence of bias in the production prob- 
abilities. Note that the variable xiy appearing in this 
expression must be replaced by 1 — J2v=i x ™ m order 
that the resulting Fokker-Planck equation contains only 
the independent grammar variables. 

As discussed above, the finite-size sampling of the (pos- 
sibly biased) production probabilities yields the stochas- 
tic contribution 



£(rcp) _ _j_ 



V-1 V-1 



*2T dxi v dxi 

v=l w=l 



XjuXj- 



(20) 



to the Fokker-Planck equation. In a physical interpre- 
tation, this term describes for each speaker i an in- 
dependently diffusing particle, albeit with a spatially- 
dependent diffusion constant, in the V— 1-dimcnsional 
space < xn+Xi2 + - ■ • + Xi l v-i < 1. On the boundaries 
of this space, one finds there is always a zero eigenvalue 
of the diffusion matrix that corresponds to the direction 
normal to the boundary. This reflects the fact that, in 
the absence of bias or interaction with other speakers, it 
is possible for a variant to fall into disuse never to be 
uttered again. These extinction events are of particular 
interest, and we investigate them in more detail below. 

The third and final contribution to (18) comes from 
speakers retaining a record of other's utterances. This 
leads to different speakers' grammars becoming coupled 
via the interaction term 



v-i 



/i(mt) 

'-a 



OX j/l; 



d 

dx ^ u 



(Xi v Xj v ) . (21) 



We end this section by rewriting the Fokker-Planck 
equation as a continuity equation in the usual way: 



dP/dt + J2 hV dJ lv /dx lv = [29, 30], where 



w — 1 

V-1 



I d 

Typ ^ ^ ~f\ {Xi v S v ^ w — Xi v Xi w ^) P{X^ t) 



2T ^ dx 

W — l 

- Gijhij ( x iv ~ x jv ) P(X, t) , (22) 

is the probability current. The boundary conditions on 
the Fokker-Planck equation with and without bias differ. 
In the former case, the boundaries are reflecting, that is, 
there is no probability current flowing through them. In 
the latter case, they are so-called exit conditions: all the 
probability which diffuses to the boundary is extracted 
from the solution space. The result (22) will be used in 
subsequent sections when finding the equations describ- 
ing the time evolution of the moments of the probability 
distribution. 



VI. FISHER- WRIGHT POPULATION 
GENETICS 



The Fokker-Planck equation derived in the previous 
section is well-known to population geneticists, being a 
continuous-time description of simple models formulated 
in the 1930s by Fisher [31] and Wright [32]. Despite crit- 
icism of oversimplification (see, e.g., the short article by 
Crow [33] for a brief history) , these models have retained 
their status as important paradigms of stochasticity in 
genetics to the present day. Although biologists often 
discuss these models in the terms of individuals that have 
two parents [34, 35] , it is sufficient for our purposes to de- 
scribe the much simpler case of an asexually reproducing 
population. 

The central idea is that a given (integer) generation 
t of the population can be described in terms of a gene 
pool containing K genes, of which a number k v have al- 
lele A v at a particular locus, with J2v=i K = V and 
v = 1, . . . , V. In the literature, an analogy with a bag 
containing K beans is sometimes made, with different 
colored beans representing different alleles. The next 
generation is then formed by selecting with replacement 
K genes (beans) randomly from the current population. 
This process is illustrated in Fig. 6. The replacement is 
crucial, since this allows for genetic drift — i.e., changes 
in allele frequencies from one generation to the next from 
random sampling of parents — despite maintaining a fixed 
overall population size. 

The probability of having k' v copies of allele A v in gen- 
eration t + 1, given that there were k v in the previous 
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FIG. 6: Fisher- Wright 'beanbag' population genetics. The 
population in generation t + 1 is constructed from generation 
t by (i) selecting a gene from the current generation at ran- 
dom; (ii) copying this gene; (iii) placing the copy in the next 
generation; (iv) returning the original to the parent popula- 
tion. These steps are repeated until generation t + 1 has the 
same sized population as generation t. 

generation, is easily shown to be multinomial, i.e., 
P(k[, k' 2 , ...,k' v ;t+ l\kx, k 2 , ...,k v ;t) = 

*! (h) kl ( k A k2 ...( k A kv (23) 

yy-wW \k) \k) ■ { ' 

Using the properties of this distribution (see Appendix), 
it is straightforward to learn that the mean change in the 
number of copies of allele A v is the population from one 
generation to the next is zero. If we introduce x v (t) as the 
fraction k v /K of allele A v in the gene pool at generation 
t, we find that the second moment of this change is [34] 

([x v (t + l)-x v (t)][x w (t + l)-x w [t)]) = 

— (x v (t)S VlW -x v (t)x w (t)) . (24) 
ZK 

By following the procedure given in the Appendix, one 
obtains the Fokker- Planck equation 

dP(x,t) _ 1 d 2 _ . 

a, / , a n \X V VW X v X w )r\X,l) \Zo) 

at 2K ^-^ ox v ox w 

v,w 

to leading order in 1/K . Since one is usually interested in 
large populations, terms of higher order in 1/K that in- 
volve higher derivatives are neglected. Thus one obtains 
a continuous diffusion equation for allele frequencies valid 
in the limit of a large (but finite) population. 

We see by comparing the right-hand side of (25) with 
(20) that the Fisher- Wright dynamics of allele frequencies 
in a large biological population coincide with the stochas- 
tic component of the evolution of a speaker's grammar. 
Because of this mathematical correspondence, it is use- 
ful occasionally to identify a speaker's grammar with a 
biological population. However, as noted at the end of 
Sec. Ill, this should not be confused with the population 
of utterances central in our approach to the problem of 
language change. 



As we previously remarked, the fact that a speaker 
retains a record of her own utterances means that the 
grammar of a single speaker will be subject to drift, even 
in the absence of other speakers, or where zero weight 
Hij given to other speaker's utterances. In this case, a 
single speaker's grammar exhibits essentially the same 
dynamics as a biological population in the Fisher- Wright 
model. We outline existing results from the literature, 
as well as some extensions recently obtained by us, in 
Sec. VII below. 

The requirement that the population size K is large 
for the validity of the diffusion approximation (25) of 
Fisher- Wright population dynamics relates to the large- 
T approximation of Sec. VA. By contrast, the small-A 
approximation relates to an ageing population, i.e., one 
where a fraction A/(l + A) of the individuals are replaced 
in each generation. This is similar to a Moran model in 
population genetics [36], in which a single individual is 
replaced in each generation. Its continuous-time descrip- 
tion is also given by (25) but with a modified effective 
population size K . 

It is worth noting that when production biases are 
present, i.e., the parameters m vw are nonzero, the result- 
ing single-speaker Fokker-Planck equation corresponds to 
a Fisher- Wright process in which mutations occur [34]. 
In the beanbag picture, one would realize this mutation 
by having a probability proportional to m vw of placing 
a bean of color v in the next population, given that the 
bean selected from the parent population was of color w. 
It is again possible to obtain exact results for this model, 
albeit for a restricted set of mutation rates. We discuss 
these below in Sec. VII. 

The remaining set of parameters in the utterance selec- 
tion model, hij, correspond to migration rates from pop- 
ulation j to i in its biological interpretation. It is appar- 
ently much more difficult to treat populations coupled in 
this way under the continuous-time diffusion approxima- 
tion. A prominent exception is where one has two popula- 
tions: a fixed mainland population and a changing island 
population [34] . The assumption that the mainland pop- 
ulation is fixed is reasonable if it is much larger than the 
island population. Since a speaker's grammar does not 
have a well-defined size, this way of thinking is unlikely 
to be of much utility in the context of language change. 
Therefore in Sec. VIII we pursue the diffusion approx- 
imation where all speakers (islands) are placed on the 
same footing. This work contrasts investigations based 
on ancestral lineages ( "the coalescent" ) that one can find 
in the population genetics literature (see, e.g., [37] for a 
recent review of applications to geographically divided 
populations). We shall also make use of these results to 
gain an insight into the multi-speaker model. 

Finally in this section we note that a feature ubiquitous 
in many biological models, namely the selective advan- 
tage (or fitness) of alleles, is not relevant in the context of 
language change. For reasons we have already discussed 
in Sec. II, we do not consider lingueme variants to possess 
any inherent fitness. 
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VII. SINGLE-SPEAKER MODEL 

We begin our analysis of the utterance selection model 
by considering the case of a single speaker which is non- 
trivial because a speaker's own utterances form part of 
the input to her own grammar. We outline both rele- 
vant results that have been established in the population 
genetics literature, along with an overview of our new 
findings which we have presented in detail elsewhere [6]. 
We begin with the case where production biases (muta- 
tions) are absent. 



A. Unbiased production 

When the probability of uttering a particular variant 
form v is equal to the frequency x v stored in the speaker's 
grammar (we drop the speaker subscript i as there is only 
one of them), the Fokker-Planck equation reads 



8P(x, i) 
dt 



v-i v-i 



or 51 51 



if 



2T ^ ^ dx v dx x 

V = l W = l 



-x v x w )P (26) 



where V is the total number of possible variants. We see 
that in this case, T enters only as a timescale and so we 
can put T = 1 with no loss of generality in the following. 

One way to study the evolution of this system is 
through the time-dependence of the moments of x v . Mul- 
tiplying (26) by x v (t) k and integrating by parts one finds 
[6] ' 



d(x v (t) k ) _ fc(fc-l) 



dt 



(xvitf"- 1 )) - (x v (t) k ) . (27) 



We see immediately that the mean of x v is conserved 
by the dynamics. The higher moments have a time- 
dependence that can be calculated iteratively for k — 
2,3,.. .. For example, for the variance one finds that 

(x v (t) 2 ) - {x v {t)f = x Vt0 (l - x vfi )(l - e _t ) . (28) 

Remarkably — and as we showed in [6] — the full time- 
dependent solution of (26) can be obtained under a suit- 
able change of variable. The required transformation is 



(29) 



i-E 



j<i x j 



which maps the space < x\ + X2 + ■ ■ ■ + xv-i < 1 onto 
the V— 1 dimensional unit hypercube, < it, < IVi. In 
the new coordinate system the Fokker-Planck equation 
is [6] 



The solution is then obtained by separation of variables. 
First, we separate the time and space variables so that 
given a fixed initial condition u one has 



P(u,t) 



C\ v {u )$ Xv (u)e 



-Xvt 



(32) 



Here, A and $a v (u) are the eigenvalues and correspond- 
ing eigenfunctions of the operator T>y appearing in (30) , 
and C\ v ({to) a set of expansion coefficients that are de- 
termined by the initial condition. 

One can then separate each of the u variables, since we 
have the recursion 



V v+1 (ui, . . .,u v ) = T> 2 (ui) + 



1 — i*i 



1 ^ , . 

, ,u v ) . 
(33) 

To see this, let us assume we have found an eigen- 
function $\ v (u±, . . . , Uy-i) of the I^-variant operator 
T>v(ui, . . . ,uv-i) with accompanying eigenvalue Ay. 
Now, we make an ansatz 

®\ v+1 (ui, ■ ■ -,u v ) = i>\ v+1 .\ v (ui)§x v {u 2 , ■ ■ -,u v ) 

(34) 

for an cigenfunction of the V+l- variant operator 
£V+i(wi, ■ • ■ i uv), where the corresponding eigenvalue 
\v+i remains to be determined. Inserting this ansatz 
into (33) yields the ordinary differential equation 



1 d2 n 
2d^ u(1 - 



Ax 



w+i 



l-u 



■(«) 



(35) 

that has to be solved for the function ip. Note that when 
V = 2, we have only one independent variable u\ and the 
eigenfunction of £> 2 (wi) with eigenvalue A2 is the solution 
of (35) with Ai = 0. Beginning with this case in (34) 
and iterating the requisite number of times, one finds 
the solution for an eigenfunction of the V- variant Fokker- 
Planck equation is 

(36) 

That is, the partial differential equation (30) is separa- 
ble in the variables Ui as claimed, and each factor in the 
product is a solution of the ordinary differential equation 
(35) that contains two parameters. After an appropri- 
ate substitution, (35) can be brought into a standard 
hypergcomctric form whose solutions are Jacobi polyno- 
mials [38] . This analysis [6] yields the eigenvalues of the 
Fokker-Planck equation. When there are initially V vari- 
ants these arc 



Ay — T^v-iiLy-i + 1) • 



L V = J2(£ W + 1) (37) 



dP(u,t) 

at 



V-l 



d 2 u v {l-u v ) 



i^d<Y[ w<v {i-u w y (30) 

= V v {u u ...,u V - 1 )P . (31) 



in which the variables t w are non-negative integers. 

Note that all the eigenvalues are positive: that is, the 
function P(u, t) decays over time. This is because of the 
fact that, when no production biases are present, once 
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a variant's frequency vanishes, it can never be uttered 
again: i.e., variants become extinct until eventually one 
of them becomes fixed. Hence, the stationary probability 
distribution comprises delta functions at the points where 
one of the frequencies x v = 1. Since the mean of the dis- 
tribution is conserved (see above), the weight under each 
delta function — which is the probability that the corre- 
sponding variant is the only one in use as t — > oo — is 
simply the variant's mean frequency in the initial con- 
dition. Although we do not give the solution explicitly 
here, it is plotted for a two variant unbiased system in 
Figure 7. The distribution in the interior of the domain 
decays with time, as the probability of one variant being 
eliminated (not plotted) grows. 




FIG. 7: Time development of the exact solution of the Fokker- 
Planck equation for a single speaker with two variants ini- 
tially, when bias is absent and xq = 0.7. 



is given by 

1 cx 

f v (x ,t) = x v . - - ^2(-l) e [P i+1 (l - 2x v . ) - 
1 e=i 

P^ 1 (l-2x vfl )]e^+ 1 ^ 2 (38) 

in which Pg(z) is a Legendre polynomial. Several other 
results can be obtained by utilizing the above reduction 
to an equivalent two- variant problem together with com- 
binatorial arguments. For example, the probability that 
exactly r variants coexist at time t may be expressed 
entirely in terms of the function / and various combina- 
torial factors [6]. 

Other quantities, such as the mean time to the rth 
extinction, or the probability that a set of variants be- 
come extinct in a particular order, can be most easily 
found from the backward Fokker-Planck equation [29], 
which involves the adjoint of the operator £,\ lcp \ In some 
cases, one can carry out a reduction to an equivalent two- 
variant problem wherein such quantities as the mean time 
to fixation of a variant v averaged over those realizations 
of the dynamics in which it does become fixed [40] 

r v = _ 2 (l-*«.o)h(l-*«M>) (39) 

X v ,0 

come into play. Note, however, that this reduction is not 
always possible. For instance, in the two examples given 
at the start of this paragraph, the former can be calcu- 
lated from such a reduction, whereas the latter cannot. 
These subtleties are discussed in [6]. 



B. Biased production 



It is remarkable that the solution of the Fokker-Planck 
equation for V variants is not much more complicated 
than the solution of the corresponding equation for 2 vari- 
ants. This turns out to be a feature of other quantities 
associated with this problem. For example, the probabil- 
ity f v (xo,t) that variant v is the only one remaining at 
a finite time t, given an initial condition xq, can be cal- 
culated rather easily because a reduction to an effective 
two-variant problem can be found to work in this case as 
well. To understand this idea, it is helpful to return to 
the beanbag picture of population genetics of the previ- 
ous section. We are interested in knowing the probability 
that all beans in the bag have the same color — say, for 
concreteness, chartreuse. Let then x be the fraction of 
such beans in the bag in the current generation. In the 
next generation, each bean has a probability x of being 
chartreuse, and a probability 1 — x of being some other 
color. Clearly, the number of chartreuse beans in the 
next generation has the distribution (23) with V = 2, 
which is the reduction to the two-variant problem. The 
form of / in this case was first found by Kimura [39] and 



We turn now to the case where the production proba- 
bilities and grammar frequencies are not identical, but re- 
lated by (3). Here, calculations analogous to those above 
are possible in those cases where m vw = m v . That is, in 
the interpretation where m vw are mutation rates, we can 
obtain solutions when mutation rates depend only one 
the end product of the mutation. 

To calculate moments of x v (t) it is most efficient to 
use the Fokker-Planck equation in the form dP/dt + 
^2 v dJ v /dx v — and the explicit formula for the cur- 
rent (22) adapted to the single-speaker case to find the 
equation satisfied by the moments: 

d(x v (t) k ) f k 8P(x,t) 

— s~ = J dxx v^r- 

= -^2 [ dxx k v ^ = k [ dxxi k -^ J v (x,t),(40) 
w J ux w J 

using the condition that the current vanishes on the 
boundary. Using Eq. (22) the equation for the first mo- 
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ment, for instance, is 
d(x v (*)) 



= - (m w (x v ) - m v (x w )) 

= I - mj (x v ) + m v (1 - (ar„)) 
= m„ - #(a;„) (41) 



in which i? = J2 v =i m v- This has the solution 
(xv(t)) = ^-+[x vfi -—je 



-Rt 



(42) 



a result that does not depend on the number of to- 
kens exchanged per interaction since this affects only the 
stochastic part of the evolution. Higher moments have 
more complicated expressions which can be found in [6]. 

Once again, we can find the complete time-dependent 
solution of the Fokker-Planck equation using the same 
change of variable and separation of variables as before. 
To achieve this, one makes the replacement 

Id Id 

2dv~ Uv ^~ Uv ^ ~* ^-Q^- u v( l - u v)+( R v u v-m v ) (43) 
in Eq. (30) and where we have introduced 



0.2 0.4 0.6 0.8 1 

x 



FIG. 8: The stationary distribution with one speaker and two 
variants for mi = mi = 0.2. 



(44) 




Note that it is necessary to reinstate the parameter T 
since two timescales are now in operation: one corre- 
sponding to the probabilistic sampling effects, and the 
other to mutations. In the ensuing separation of vari- 
ables, we find that each product ip in the eigenfunction 
analogous to (36) picks up a dependence on the variant 
v through the parameters m v and R v . The eigenvalue 
spectrum also changes, becoming now 

V = ^rpL' v _ 1 (2TR + L' M _ X - 1) , L' v = j2l w (45) 

w — 1 

where £ w are, as before, non-negative integers and R = 
Y^w=\ m w- On this occasion, we have a zero eigenvalue 
when £ w = OVw. The corresponding eigenfunction is 
then the the (unique) stationary state P*(x) which is 
given by 



P*(x)=T(2R)l[ 



„2Tm„ 



r(2m„) 



(46) 



This result first appeared for the case V = 2 in Rcf. [32]. 

When V = 2, this is a beta distribution. It is peaked 
near the boundaries when mi and m 2 are both less than 
1/2, as illustrated in Figure 8. When the bias parameters 
are greater than 1/2, the distribution is centrally peaked, 



FIG. 9: The stationary distribution with one speaker and two 
variants for mi = 0.8 and m,2 = 0.6. 



and is asymmetric when mi ^ m 2 , as can be seen in 
Figure 9. 

It is perhaps interesting to note that the probability 
current is zero everywhere in this steady state: i.e., that 
a detailed-balance criterion is satisfied. It seems likely 
that the more general situation where 711, yiu CcLll depend 
both on the initial and final variants will give rise to a 
steady state in which there is a circulation of probability. 
We believe a solution for this case has not yet been found. 

Finally in this survey of the single-speaker model we 
remark on the existence of a hybrid model in which some 
of the production biases are zero. Then, those variants 
that have x v = will fall into disuse, and the subsequent 
dynamics will be the same as for the case of biased pro- 
duction among that subset of variants to which mutation 
is possible. 



VIII. MULTI-SPEAKER MODEL 

Having established the basic properties of the single 
speaker model — moments, stationary distribution and 
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fixation times — we now seek their counterparts in the 
rather more realistic situation where many different 
speakers are interacting. The large number of poten- 
tial parameters specifying the interaction between speak- 
ers (Gij and hij) means the complexity of the multiple 
speaker model is much greater than that for a single 
speaker. However, some analytic results can be obtained 
by considering the simplest set of interactions between 
speakers, one where all the interaction probabilities and 
weightings are equal. That is, we set 

Gl ° = G= 2N(N-1) and kij = k V *' J ' (47) 

This greatly simplifies the situation, as the interactions 
between speakers are now identical, with different speak- 
ers being only distinguished by their initial conditions. 
From a linguistic point of view, it also seems natural to 
begin with all speakers interacting with the same proba- 
bility, as might happen in a small village [41, 42]. We are 
also not considering social forces here, and so we assume 
that Hij is constant. It can also be seen from the results 
for a single speaker that the majority of behaviors can 
be observed in systems with only two variants. There- 
fore we will not consider more than two variants for the 
remainder of this section. 

The Fokker-Planck equation (18) now takes the rela- 
tively simple form 



dt y 



(N 



- mi) 



^-Xx-fl 



2T dx_ 
P 



Xi) 



Note that the sum over j in this expression can be written 
as N(x) — (x^ where 



where we use x without a subscript to denote the over- 
all proportion of the first variant in the population 
x = ^ Ji Xi/N. The parameter mi is the bias parameter, 
mi = mi2, and R = mi + m 2 = mi 2 + m 2 i. Although 
we have not succeeded in solving this equation exactly, 
we have been able to perform a number of calculations 
and analyzes which we present below. 



(50) 



is the mean frequency over the entire community of 
speakers. 

Using this substitution, and summing (49) over all 
speakers, we find that 



dt 



(x) = —G(N - 1) (R(x) — mi) . 



Subtracting this expression from (49) gives 
d 



dt 



( Xi - x) = -G [(N - 1)R + Nh] ( Xi - x) 



(51) 



(52) 



These equations are now decoupled and their solution 
follows readily after implementing the initial condition 
and using the definitions (47). We find that 



(*<(*)} = -£ + [( x o - -£-) 



+ (x h0 -xo)e- ht ^ N -V 

, ... mi 
<*(*)} = 

where x = x(0) = jjJ2i x i,o- 



-Rt/2N 



R 



(53) 
(54) 



+hj^-(xi - j^iY^jjn x j)} 

+ T^lh£ l {x i -x)}p (48) 
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A. Moments 

Differential equations for moments of Xi can be found 
using the same methods as before. When production 
biases are present we find, by multiplying (48) by Xi, in- 
tegrating and using the fact that the probability current 
vanishes at the boundaries, that (compare with Eq. (41)) 



( Xi ) = -(N-1)G 



(R + h){xi) - mi 



N - 



(49) 



FIG. 10: The time development of the mean of a single 
speaker (xi) for two different choices of mutation parame- 
ters. In each case a?i,o = 0.7, N = 10 and h — 0.5. T = 1. 
The overall population mean (a;) is shown as a dashed line for 
comparison, with xo = 0.3. 

Each speaker's mean thus converges to the commu- 
nity's mean at a rate controlled by h, and the latter re- 
laxes to the fixed point of the bias transformation M at 
a rate determined by R. In both cases, the decay time 
grows linearly with the number of speakers N. This be- 
havior is shown in Figure 10 in which the time develop- 
ment of the mean of a particular speaker has been plotted 
for two different bias parameter choices. 
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In the unbiased case we can repeat the same procedure 
to find the time dependence of (xi). The result is simply 
(53) and (54) with R and mi set to zero, though one 
must be careful with the boundaries when deriving the 
equivalent of (49). In particular 



(x,(t))= I0 l(x,,,-,„) e -« , 



(55) 



and we see explicitly that the expected overall fraction 
of each variant in the population is conserved, just as in 
the single speaker case: 



(x(t)) = XQ 



(56) 



Although we could write time dependent equations for 
higher moments, they are much more complicated. In- 
stead we now turn to the stationary distribution. 



B. Stationary distribution 

In the absence of production biases, the stationary dis- 
tribution is one in which all speakers' grammars contain 
only one variant. This is similar to the situation for a 
single speaker, only we should note that (except in the 
special case of h = 0, which is equivalent to the single 
speaker case) equilibrium is only reached when all the 
speakers have the same variant. Since (x(t)) is conserved 
by the dynamics, we have once again that the weight 
under the delta-function peaks equal to the initial mean 
frequency of the corresponding variants within the entire 
community. In the next subsection, we shall investigate 
the relaxation to this absorbing state of fixation. 

When production biases are present, we expect an ex- 
tended stationary distribution with a mean given by (54) 
in the t — > oo limit. The second moments can be cal- 
culated by multiplying Eq. (48) by xf and XiXj, i ^ j, 
integrating, and using the fact that the probability cur- 
rent vanishes at the boundaries, just as in the derivation 
of Eq. (49), except that in this case there is no time 
derivative. Using the symmetry of the speakers we find 
that 

R + h + ^j {xD* - (mi + ( Xi y - h{x iXj )* = 

(57) 

[(N - 1)R + h]{x iXj )* — (N - l)mi(xi)* - h{x1)* = 

(58) 

where the asterisk denotes the steady state. Solving gives 



mi f (N -l)Rm + h[(N -l)mi+m] 



(xtr = 



R { (N - 1)RR + h[(N - l)R + R] 
and, for i ^ j, 



(59) 



mi 



(XiXj) — 



(N - l) mi R + h[(N - l)mi + m] 



R 1 (N - 1)RR + h[(N - l)R + R] 



where m = mi + 1/2T, R = R + 1/2T. For the overall 
proportion of the first variant 



1,3 



m x 
= NR X 

{N-l)Rm+ {N-l) 2 m l R + Nh[{N-l) mi + m] 



(N - 1)RR + h[{N - 1)R + R] 



(61) 



where the sum on the first line now includes the case 
i = j- 

When there are only two variants, the single speaker 
stationary distribution (46) is a beta distribution. The 
marginal distribution for each speaker in the multiple 
speaker model is modified by the presence of other speak- 
ers, but still the distribution is peaked near the bound- 
aries when the bias is small, and changes to a centrally 
peaked distribution as the bias becomes stronger. We 
therefore propose that it is appropriate to approximate 
the stationary marginal distribution as a beta distribu- 
tion with mean and variance just calculated. That is, 



P*(*>) 



Tja + (3) 



where 



a = 2Tmi 



(N-l)R+hN 



(3 = 2T(R-m 1 ) 



(N-l)R + h 

(N — 1)R + hN 



(N - 1)R + h 



(62) 

(63) 
(64) 



(60) 



Unlike in (46) the parameters of the distribution now 
depend on h and N as well as m v . The marginal dis- 
tribution is well fitted by this beta distribution for a 
broad range of h and N. An example is shown in Fig- 
ure 11, where the distribution calculated from simula- 
tions is compared to an approximating beta distribution. 

When N and h are small, the transition from concave 
to convex shape occurs at approximately the same val- 
ues of the mutation parameters as it does in the single 
speaker case, when mi = mi — 0.5. As N or h become 
larger, the transition value becomes smaller. For suffi- 
ciently large N or h, individual speakers will retain sig- 
nificant proportions of both variants, even for very small 
(but still non-zero) bias parameter values; the distribu- 
tion will be centrally peaked unless mi and mi are ex- 
tremely small. This can be seen in Figure 12, which 
shows the value of m = mi = m 2 at which the transi- 
tion from concave to convex takes place for a range of h 
and three different population sizes. This critical value 
of m, denoted by m c , is the value of m for which the 
parameters a and (3 in Eq. (62) pass through 1. 

The stationary distribution of x (the proportion of 
variant 1 throughout the population of speakers) on the 
other hand, does not always have a simple shape. Con- 
sider first when the mutation strength is fixed at some 
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Gaussian distributions calculated from the mean and sec- 
ond moment fit well — see Figure 13. The value of h only 




FIG. 11: The single speaker marginal stationary distribution 
when N = 10, h — 0.2A and mi = 7712 = 0.2. Bars are the 
distribution obtained from simulation, while the curve is the 
approximate beta distribution. 





FIG. 12: The mutation value m c at which the stationary pdf 
of Xi changes from a concave to a convex distribution, as a 
function of h for N = 2, N = 10, N = 100. Mutation is 
assumed symmetric: mi = mi. 



small value: mi = m 2 <C 0.5. When h is small some 
speakers can be at opposite ends of the interval. For small 
N, this leads to a multiply peaked distribution, with each 
peak representing a certain fraction of the speakers being 
at one end. As h gets larger, the tendency to be at the 
same end increases, and the central peaks dwindle, leav- 
ing the familiar double-peaked distribution. This only 
holds so long as the mutation strength remains below 
the critical value m c , as shown in Figure 12. For suffi- 
ciently large h or for larger N, the distribution becomes 
centrally peaked. 

When mi and m 2 are above the critical value, or if N 
is sufficiently large that the central-limit effect becomes 
significant, the stationary distribution of x is smooth and 
single peaked for all values of h, becoming more bell 
shaped the higher the value of N in accordance with the 
central limit theorem. Here wc find that both beta and 



FIG. 13: The average speaker stationary distribution when 
N = 10, h = 0.2A and mi = m,2 = 0.2. Bars are the dis- 
tribution obtained from simulation, while the curve is the 
approximate beta distribution. 



C. Fixation times 

In the calculations of Sec. VIII A we established that 
a single speaker's mean converges to the overall commu- 
nity's mean more slowly as the number of speakers is 
increased. When production biases are absent, we can 
also anticipate that the time to reach fixation also in- 
creases with the number of speakers. This fact can be 
established analytically by re-casting the description of 
the system in terms of the coalescent, a technique which 
can be found in [43, 44]. We will not give the details of 
this calculation here, but merely state the result, which 
is derived in [45]. The mean time to extinction of the 
second variant, which corresponds to fixation of the first 
is 



r 2 [A(0)] = 



l-^o 
x 



N(N-l) 
2h 



F[X(0)}-TN 2 In(l-xo) 
(65) 

Note that the second term is of the same form as (39). 
The function F depends on the initial distribution of 
speaker's grammars. For example, when all the speakers 
start with the same initial proportion (aj»(0) = xo Vi), 



N-l 



F[X(Q)] = ]T 



m 



x 1 



„N-1 



N 1-xq 



(66) 



while when M — Nx of the speakers start with Xi = 
1 and N — M start with Xi — (so that the overall 
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FIG. 14: The mean time to fixation to each boundary as a 
function of h, for a system with 20 speakers and xo = 0.3. 
The solid curves are for an inhomogeneous initial condition, 
and the dashed curves are for a homogeneous initial condition. 
The lower curves are T2 and the upper curves are t\ . 



proportion is still the same), 



M (M 



F(X(0)) = £ 



( ) 1 



^ INS 

= 1 W 



m 



(67) 



These are perhaps the extreme possibilities for the dis- 
tribution, and in fact the values of F differ little. For 
large N they are virtually the same and both are well 
approximated by 



F(X(0)) ~-ln(l-s ) 



(68) 



which gives the much simpler expression for the mean 
time to extinction of the second variant 



T 2 



XQ 



ln(l - x ) 



N(N — 1) 
2ft 



+ TN 2 



(69) 



that appeared in [44]. Figure 14 shows the mean time 
to fixation at each boundary (n and r 2 ) for a system 
with only 20 speakers. Already the times for inhomoge- 
neous (solid lines) and homogeneous (dashed lines) are 
very similar. Notice also the dramatic increase in the fix- 
ation time as ft becomes smaller. To calculate the mean 
to to fixation of any variant, we take a weighted average 
of the time for each variant: 



t = x t 2 + (l - xo)n 

- -[(1 - iro)ln(l - x Q )+x ln(x ) 



N(N-l) _ Ar2 



2ft 



+TN< 



(70) 



whereas the moments were seen to relax with time con- 
stants that grow linearly with N. These results relate to 
the qualitative behavior observed in simulation. One no- 
tices the initial condition relaxes quickly to one in which 
speakers have a distribution that persists for a long time 
until a fluctuation causes extinction of a variant. The na- 
ture of this distribution depends on the size of ft. When it 
is very small, the attraction of speakers to the boundaries 
is stronger than that to the other speakers. Therefore, 
some speakers dwell near the x = boundary, others 
near the x = 1 boundary and only a few being in the 
central part of the interval at any one time. Here it is 
evident that for fixation to occur, one needs all speakers 
near one of the boundaries thus explaining why the fix- 
ation time is so much longer than the initial relaxation. 
For larger ft, the attraction between speakers overcomes 
the tendency to approach the boundaries, so the speakers 
tend to dwell in the interior of the interval. 




D. Quasi-stationary distribution en- route to 
fixation 

An interesting feature of the fixation time is that it 
increases quadratically with the number of speakers N, 



FIG. 15: The distribution of speaker grammar values over 
a time series, for (top) an ensemble of realizations (none of 
which reach fixation during the period shown) and (bottom) 
the analytic beta distribution approximation, both for N = 20 
and h = 0.2A. 
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FIG. 16: The number of realizations remaining unfixed at 
time t, with initially 1000 realizations. Dashed curve is 
1000e" t/T where r is given by Eq. (70) 

We shall concentrate on the quasi-stationary distribu- 
tion with h small. We obtain this using a mean-field ar- 
gument, expected to be valid for large N. As usual when 
applying mean- field theory we focus on one constituent, 
in this case speaker i. We then replace the term involving 
all the other speakers in the Fokker-Planck equation by 
an average value. Thus Eq. (48), in the unbiased case, 
becomes 

|p = { N-l)GE i {4r£j*i(l-Zi) 

+h£:(x i - (x))} P. (71) 

The solution to this equation is separable, so we write 
P(X,t) = Y\iP{xi,t), and find the Fokker-Planck equa- 
tion for a single speaker to be 

^p(xi,t) =(N-l)G{£:(hx i -h(x)) 

+ 4 r £;X i {l-x i )} P {x i ,t). (72) 

After a rescaling of time t — * (N — l)Gt, and dropping 
the index i, this is exactly the Fokker-Planck equation 
for a single-speaker with bias and two variants, with the 
identification h — > R and h(x) — > mi. At large times 
we have from (55) that (xi) = x = 2^0 • Therefore we 
expect that at large times the solution of the Fokker- 
Planck equation to be identical to that of the single- 
speaker Fokker-Planck equation with bias, as long as the 
identification R — > h and mi — > hxo is made. In particu- 
lar, we expect the marginal probability distribution for a 
single speaker to have a stationary form which is a beta 
function of the form (46) with V = 2 and 2Tmi -» 2Thx 
and 2Tm 2 -> 2Tft(l - x ), that is, 

Punfixed(Si) - T{ j T { P p_ JTf rl{1 X ^ (P ^ 1 ' ( ?3 ) 



0.8. ... 



0.6, 




0.8. ... 




FIG. 17: The distribution of speaker grammar values over a 
time series, for (top) an ensemble of realizations (including 
fixing realizations) and (bottom) the analytic beta distribu- 
tion approximation, both for N = 20 and h = 0.2A. 

where 

p = 2Th and p = 2Tx Q h . (74) 

This distribution is shown in the lower half of Figure 15 
for the case of h small. In the upper half of this figure 
is the equivalent distribution calculated from numerical 
simulations, and it can be seen that the shape is main- 
tained over time (the numerical result only includes real- 
izations that do not fix in the time period specified) , and 
that it is very similar to the beta approximation. 

If we assume that the rate at which any individual re- 
alization of the process becomes fixed is constant, the 
number of unfixed realizations exhibits an exponential 
decay with a time-constant r given by (70). That this 
is the case is suggested by Fig. 16 in which the num- 
ber of unfixed realizations as a function of time obtained 
from Monte Carlo simulation is compared with this pre- 
diction. This then suggests for the full time-dependent 



18 



distribution the expression 

^■^ rw^) <" 1(1 -^'" le '' /M75) 

In Figure 17 we compare this approximation, shown in 
the lower half, with numerical results in the upper half 
(where now the numerical results include realizations 
that fix during the time interval). 

IX. DISCUSSION AND CONCLUSION 

In this paper we have cast a descriptive theory of lan- 
guage change, developed by one of us [3-5], into a mathe- 
matical form, specifically as a Markovian stochastic pro- 
cess. In the resulting model there are a set of N speakers 
who each have a grammar which consists of V possible 
variants of a particular linguistic structure (a lingueme). 
In the initial phase of formulating the process, two speak- 
ers out of the N are picked out at every time step and 
allowed to communicate with each other. The utterances 
they produce modify the grammar of the other speaker 
- as well as their own — by a small amount. Another 
two speakers are then picked at the next time step and 
allowed to communicate. This process is repeated, with 
two speakers i and j being chosen at each time step with 
a probability Gij. This matrix therefore prescribes the 
extent of the social interaction between all speakers. 

After many time steps the initial grammar of the 
speakers will have been modified in a way which depends 
on the choice of the model parameters. The above for- 
mulation, that is, in terms of events which happen at 
regular time steps, is ideal for computer simulation. Of 
course, the model is stochastic, and so many independent 
runs have to be carried out, and the results obtained as 
averages over these runs. The randomness in the model 
enters in two ways: in the choice of speakers i and j and 
in the choice of the variants spoken by a speaker in a 
particular utterance. We showed that it is possible to 
take the time interval between steps to zero, and derive 
a continuous time description of the process. When this 
procedure is carried out, the model takes the form of a 
Fokker-Planck equation. 

The whole approach to language change we have been 
investigating was conceived as an evolutionary process, 
with lingucmcs being analogous to genes in population 
genetics. So it is perhaps not surprising that the math- 
ematical structures encountered when quantifying these 
theories are so similar. However, as stressed in Sec. VI, 
there are important differences. The most direct corre- 
spondence with population genetics is when there is a sin- 
gle speaker and where the number of tokens is large. Fur- 
thermore, at each time step the update rule (6) applies in 
the linguistic model, whereas the equivalent rule in the 
population genetics case would be x(t + St) = K~ 1; n(t) 
corresponding to a completely new generation of K in- 
dividuals being created through random mating. Thus 
the genetic counterpart is formally equivalent to letting 



A — > oo, and giving the previous grammar (x(t)) zero 
weight compared to the random element (n(t)); for the 
actual linguistic problem, A is small, and it is x(t) that 
has by far the greater weight. Taking A — > oo and rein- 
stating the factor of T through a rescaling of the time, 
does indeed give the population genetics result (25), with 
K taking the role of T. Although the limit A — > oo is 
the precise correspondence, the scaling choice (15)— (17) 
which we use also gives a mathematical, if not a precise 
conceptual, equivalence between the genetic and linguis- 
tic models. 

Our analysis of the Fokker-Planck equation began by 
considering the case of only one speaker. This is far from 
trivial, and as we have seen is formally equivalent to stan- 
dard models of population genetics. This has the advan- 
tage that many results from population genetics may be 
taken over essentially without change. Remarkably, the 
Fokker-Planck equation is in this case exactly soluble. 
This is due to the simple way in which the equation for 
V variants is embedded in the (V + Invariant equation. 
A similar simplification holds when calculating quantities 
such as the probability that a given number of variants 
coexist at time t or the mean time to the nth extinc- 
tion of a variant: they can be related by induction to the 
solution of the two- variant problem. 

While the exact solution of the mathematically non- 
trivial single speaker case gives considerable insights into 
the effects caused by the bias (or mutation) term (19) 
and the diffusion term (20), to understand the evolu- 
tion of variants across a speech community it is clearly 
necessary to include the third term (21) in the Fokker- 
Planck equation. In Sec. VIII we carried out an analysis 
of the model with this term included in the simplest situ- 
ation where all speakers were equally likely to talk to all 
other speakers (Gij independent of i and j) and where all 
speakers gave the same weight to utterances from other 
speakers (hij independent of i and j). Just as for the 
single speaker case, there are distinctions between the 
situations where there is bias and where there is no bias. 
Whilst the presence of a bias (through the term (19)), 
makes the model more complicated, its behavior is in fact 
simpler than if there were no bias: the distribution of the 
probability of a variant in the population tends to a sta- 
tionary state which can be approximately characterized 
as a beta distribution. As we have seen, when no bias is 
present, interactions between the speakers causes them 
all to converge relatively quickly to a common marginal 
distribution which persists for a long time until a fluctua- 
tion causes the same variant to be fixed in all grammars. 
Under a mean-field-type approximation, valid in the limit 
of a large number of speakers, we established the form of 
this quasi-stationary distribution. 

In this paper, we have been primarily concerned with 
the mathematical formulation of the theory and begin- 
ning a program of systematic investigation of the model. 
We believe that we have laid the foundations for this 
study with the analysis we have presented, but clearly 
there is much left to do. In order to make connection with 
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observational data we will need to consider more realistic 
social networks through which linguistic innovations may 
be propagated — i.e., non-trivial Gij, as in Fig. 1. Bear- 
ing in mind the proposed importance of social forces that 
described in Sec. II, it will also be necessary to include of 
speakers or groups of speakers which may have more in- 
fluence on language change than others — i.e., non-trivial 
Hij . Many of these cases will only be amenable to anal- 
ysis through computer simulations, but it should be pos- 
sible to obtain some analytical results with, for example, 
a simplified network structure. However, it is clear that 
even without any further developments, some of our re- 
sults can be generalized. For instance, by proceeding as 
in Sec. VIII A, we can find that for general Gij and hij, 



d(xj) 
dt 



Gij h i:j ((xi) - (xj)) 



(76) 



and therefore that the rate of change of (x) = J2i( x i) 1S 
given by 



d(x) 
dt 



= J2J2 G ij( h ij- h 0i)( X i)- ( 7? ) 

i j^Li 

Therefore (x) is conserved not only when h is constant, 
as demonstrated in Sec. VIII A, but also when h^ is sym- 
metric. In fact, the result can be further generalized. If 
we define the net "rate of flow" by 



UJ,j 



{Gij h 



then Eq. (77) may be written as 

d(x) 



dt 



= ^2^i(xi) . 



(78) 



(79) 



So as long as Ui = for all i, which may be thought of 
as a kind of detailed balance condition, then the overall 
mean is conserved. Now if the mean is conserved, then 
the probability of a particular variant become fixed is 
simply its initial value. Therefore no matter what the 
network or social structure, if . Gijhij — Gijhji 
for all i, then this structure will have no effect on the 
probability of fixation. 

It is clear, however, that in general the further devel- 
opment of the model will necessitate the choice of a par- 
ticular network and social structure. As an example of 
this we have recently begun to analyze the model in the 
context of the formation of the New Zealand English di- 
alect, for which a reasonable amount of data is available 
[42, 46]. In particular these give some information about 
the frequencies with which different linguistic variables 
were used by the first generations of native New Zealand 
English speakers and their ultimate fate in the forma- 
tion of today's conventional dialect. Predictions from our 
model relating to extinction probabilities and timescales 
will play an important part in better understanding this 
data. More widely, we hope that the work presented here 
will underpin many subsequent applications and form a 
basis for a quantitative theory of language change. 
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APPENDIX A: DERIVATION OF THE 
FOKKER-PLANCK EQUATION 

In this Appendix we derive the Fokker-Planck equa- 
tion (18). The method is standard, and involves the 
calculation of the so-called jump-moments for the pro- 
cess under consideration [29, 30]. Since we have already 
sketched some of the background in Sec. V A for the sin- 
gle speaker case, let us begin with this simpler version of 
the model. 

Our starting point is the Kramcrs-Moyal expansion 



8P(x, t) 

Ft 



V-l 



d 



+ 



~E Q^Mx)P{s,t)} 

v — 1 v 

EI 



2 ^— ' ^— ' dx v dx 

v—l w—1 



{a vw (x)P(x,t)} 



+ ... . 



(Al) 



Here the dots represent higher order terms (which will 
turn out not to contribute) and the a functions are the 
jump moments 



a v (x) 



5t->0 Ot 

(Sx v (t)Sx w (t)) 



lim 

«->o 



St 



(A2) 
(A3) 



where Sx v (t) = x v (t + St) — x v (t) . The Kramers- Moyal 
expansion itself is derived from the assumption that the 
stochastic process is Markov together with a continuous 
time approximation [29, 30]. 

In the single speaker case we have already established 
a form for Sx v (t) (see Eq. (7)) and since the mean of the 
multinomial distribution (2) is simply 



(ri v ) — Tx v , 



(A4) 



a manipulation as in Eq. (12) and a rescaling as in 
Eqs. (15) and (16) leads to 



(8x v ) 



^ ^ {jTlyiuXutj 



m wv x v ) (St) + . . . 



(A5) 



where the dots indicate higher orders in St. There- 
fore, from Eq. (A2), a v (x) = Y^ w ^v( m vwX w - m wv x v ). 
To find the second jump moment, we need to consider 
(5x v (t)8x w (t)) , but from Eq. (7) we see that this is al- 
ready C(A 2 ), that is, 0{St). Therefore any terms in the 
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matrix M which vanish as St — ► do not contribute at 
this order. Since all off-diagonal entries and diagonal en- 
tries apart from 1, are of this form, M may be replaced by 
the unit matrix everywhere in this second order term, i.e., 
any bias can be neglected. Using Eq. (7) and Eq. (A4) 
with x' replaced by x, we obtain 



(Sx v 8x w ) 



^2 (St) ((n v n w ) 



(n v )(n w )) + ... . (A6) 



Now the variance of the multinomial distribution is given 

by 



(n v n w ) - {n v )(n w ) 



Tx v (1 x v 
Tx^.x*., 



w, 



v ^ w, 



(A7) 



and so once again replacing x' by x and using the def- 
inition of the jump moment (A3), we obtain Eq. (14). 
All higher jump moments vanish, since from Eq. (7) we 
see that the third and higher moments of Sx are at least 
0(A) 3 , that is, at least 0(Stf/ 2 . Therefore the Kramers- 
Moyal expansion is truncated at second order and we ob- 
tain the Fokkcr-Planck equation 



dP(x,t) Xr-^ d >r-^ 

= / J / J \TTl vw X w f^wvXy ) i\X : t) 

v—1 v w^v 



dt 



— V 



d 2 



2T ^ dx v dx, 

V.W 



)P(x,t). 



(A8) 



The derivation in the case of the full model with N speak- 
ers follows similar lines. Here X(t) = (xi(t), . . . , x/v(£)) 
is an N(V — 1) dimensional grammar variable whose com- 
ponents we have written as Xi v . It is sometimes conve- 
nient to replace the two labels {i,v} by the single one 
/ with / = 1,...,N(V - 1). Then Eqs. (A1)-(A3) in 
the derivation of the one-speaker case can be taken over 
by replacing v and w by / = {v,i} and J = {w, j} re- 
spectively. In the full utterance selection model, there is 
randomness both in the choice of speakers that interact 
in the interval St following time t and in the tokens they 
produce. 

The jump moments are derived from averages of prod- 
ucts of the quantity Sx j = xj(t + St) — xj(t). From (5) 
we find the analog of the one-speaker result (7) to be 



Sxi 



A 



1 + A(l + H k 



— Xi v -+- rtij y~^r 



(A9) 



for a speaker i given that speakers i and j have been 
already be chosen as the interacting pair in the time-step 
at t. 

The mean change in the grammar variable {Sxi V ) can 
then be determined by knowing that the mean of the 
multinomial distribution (2) is simply 



Then 

(Sx iv ) = 



( n iv) — Tx, it 



(A10) 



A 



1 + \(1 + Hij) 
= A 



[ x iv X iv + Hij ( x jv x iv)~\ 



^ ^ (-M vw x>i w AI wv x>i v ) -\- Hij (xj v Xi v ) 



+ 0(XHM, \ 2 H, \ 2 M) 



(AH) 



in which the second line was arrived from the first by 
using (3). Similarly, from the variance of the multinomial 
distribution 



(jliyfljw) (jliv ) (fljuj ) 

' Tx' iv (l - x' iv ) v = w,i = j 

-Tx' iv x' iw v^w,i = j (A12) 

i^j 



one finds 



(Sxi v Sxj w ) 



( x iySy : w x iv x iv 



0(X 2 H, X 2 M, A 3 ) 
(A13) 

if i — j and {Sx{ V Sxj W ) = otherwise. 

In order to have both a deterministic and stochas- 
tic part to the Fokker-Planck equation, we need both 
(Sxi v ) and (Sxi V Sxi W ) to be proportional to St in the limit 
St — > 0. One can verify that the only way this can be 
arranged is if one rescales the variables as in Eqs. (15)- 
(17), a choice which was motivated in more detail in Sec- 
tion VA. Then, only the leading terms in Eqs. (All) 
and (A13) remain in when one takes the limit St ^ 
in Eqs. (A2) and (A3). Furthermore, all higher jump 
moments vanish, as also discussed in Section VA, and 
the sum in Eq. (Al) terminates at the second moment. 
After substituting the jump moments into (Al) and aver- 
aging over all possible pairs of speakers, weighted by the 
interaction probabilities Gij, one finally arrives at the 
Fokker-Planck equation given in the main text, Eq. (18). 
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